Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix 621, where unnamed window functions shall be differentiated by partition and order by clause #622

Merged
merged 1 commit into from
Jun 27, 2021

Conversation

jimexist
Copy link
Member

Which issue does this PR close?

Closes #621

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Jun 26, 2021
@houqp
Copy link
Member

houqp commented Jun 26, 2021

should we update physical column name as well for better readability in the query output? https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/planner.rs#L137-L139

@jimexist
Copy link
Member Author

jimexist commented Jun 26, 2021

should we update physical column name as well for better readability in the query output? https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/planner.rs#L137-L139

i think this is more of the strategy on how we name unnamed columns in the schema. for window functions without an alias, it'll be created with a unique name, given the function type, arg name, partition by clause, order by clause, and window frames. any of these are equivalent then the columns can be thought of the same and thus deduplicated in the projection phase, otherwise they will be planned separately.

if any such columns are having a name conflict in the projection phase the schema validation logic will fail.

@jimexist
Copy link
Member Author

Postgres is totally okay with dup names in the projection phase:

[postgres] # select max(c1) over (), max(c1) over (), max(c1) over (partition by c1) from test limit 3;
 max | max | max
-----+-----+-----
 e   | e   | a
 e   | e   | a
 e   | e   | a
(3 rows)

@jimexist
Copy link
Member Author

if we are okay with dup names then i guess just keeping the function name would be enough

@houqp
Copy link
Member

houqp commented Jun 26, 2021

We do allow query outputs (i.e. record batch schema and physical plan schema) to contain columns with same names. This is quite common in join queries where more than one relations contain the same column name.

My previous comment was more around the inconsistency on how window function names are generated in logical plane v.s. physical plane. For example, in physical plane, we don't take partition by clause, order by clause, and window frames into account when creating the column name. If we don't want to maintain this consistency, I think we should update this invariant instead: https://github.com/apache/arrow-datafusion/blob/master/docs/specification/invariants.md#the-physical-schema-is-invariant-under-planning.

https://github.com/apache/arrow-datafusion/blob/master/docs/specification/output-field-name-semantic.md contains some rules on how we want to generate field names for physical schema. We don't have any rules for window functions at the moment, so maybe worth formalize it in that document as well.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@alamb alamb merged commit 27dc5d6 into apache:master Jun 27, 2021
@jimexist
Copy link
Member Author

We do allow query outputs (i.e. record batch schema and physical plan schema) to contain columns with same names. This is quite common in join queries where more than one relations contain the same column name.

My previous comment was more around the inconsistency on how window function names are generated in logical plane v.s. physical plane. For example, in physical plane, we don't take partition by clause, order by clause, and window frames into account when creating the column name. If we don't want to maintain this consistency, I think we should update this invariant instead: https://github.com/apache/arrow-datafusion/blob/master/docs/specification/invariants.md#the-physical-schema-is-invariant-under-planning.

https://github.com/apache/arrow-datafusion/blob/master/docs/specification/output-field-name-semantic.md contains some rules on how we want to generate field names for physical schema. We don't have any rules for window functions at the moment, so maybe worth formalize it in that document as well.

Thanks for the advice let me address in another pull request

@jimexist jimexist deleted the fix-window-unamed branch June 27, 2021 15:25
@houqp houqp added the bug Something isn't working label Jul 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unqualified window functions shall differentiate given partition and order by clause
3 participants